Skip to content

Conversation

@vaclisinc
Copy link
Contributor

@vaclisinc vaclisinc commented Nov 16, 2025

Summary

Finally bringing semantic search to reality!
This PR integrates a FAISS-backed semantic search service using BGE (BAAI General Embedding) models (based on Jacky’s work from last semester), along with backend proxy endpoints and updated catalog UX, enabling AI course search.

⚠️ IMPORTANT: Remember to add SEMANTIC_SEARCH_URL=http://semantic-search:8000 to your .env!

⚠️ IMPORTANT: First-time index building may take 2-3 minutes. In the future, index building will automatically trigger when running the datapuller.
Please run the commands below and modify the semester to build the index first!

curl -X POST http://localhost:8000/refresh \
       -H 'Content-Type: application/json' \
       -d '{"year": 2026, "semester": "Spring"}' | jq

System Architecture

flowchart LR

    %% ---------- Frontend ----------
    subgraph Frontend
        FE_UI["Search Bar + AI Search Toggle"]
    end

    %% ---------- Node Backend ----------
    subgraph NodeBackend
        ProxyRouter["/api/semantic-search/*  (proxy router)"]
        CoursesAPI["/api/semantic-search/courses  (lightweight endpoint)"]
        GraphQLResolvers["GraphQL resolvers + hasCatalogData"]
    end

    %% ---------- Python Semantic Service ----------
    subgraph SemanticService["Semantic Search Service (FastAPI)"]
        Health["/health"]
        Refresh["/refresh  (rebuild FAISS index)"]
        Search["/search  (threshold-based semantic query)"]
        BGE["BGE Embedding Model"]
        FAISS["FAISS Index (cosine similarity)"]
    end

    %% ---------- Catalog Data Puller ----------
    subgraph CatalogData
        DataPuller["GraphQL Catalog Datapuller"]
    end

    %% ---------- Data Flow ----------
    FE_UI -->|Search Query| CoursesAPI

    CoursesAPI -->|Forward to Python| Search

    Search -->|Generate Query Embedding| BGE
    Search -->|Vector Similarity Search| FAISS
    FAISS -->|Threshold-filtered Results| Search

    Search --> CoursesAPI --> FE_UI

    %% Index refresh / data ingestion
    DataPuller --> GraphQLResolvers --> |TODO:|Refresh
    Refresh -->|Fetch Catalog via GraphQL| GraphQLResolvers
    Refresh -->|Generate Embeddings| BGE --> FAISS
Loading

Examples

Input: “Memory models in concurrent programming”
→ Should return courses like databases, operating systems, etc.
→ Should not return biology or psychology courses just because of the word “memory.”

image

Input: “how to shot a hot vlog”
image


Implementation Details

Python Semantic Search Service (FastAPI)

  • FastAPI microservice (apps/semantic-search) that:

    • Uses BGE (BAAI/bge-base-en-v1.5) embedding model optimized for retrieval tasks
    • Builds term-specific embeddings + FAISS indices from GraphQL catalog data
    • Implements threshold-based filtering (returns all results above similarity threshold, not just top-k)
    • Searches top 500 candidates for performance, then filters by threshold (default: 0.45)
  • Key endpoints:

    • /health — readiness probe showing index status
    • /refresh — rebuild FAISS index for a given year/semester
    • /search — semantic query with threshold filtering
  • Model Architecture:

    • Uses instruction prefix for queries: "Represent this sentence for searching relevant passages: {query}"
    • Course text format: SUBJECT: {subj} NUMBER: {num}\nTITLE: {title}\nDESCRIPTION: {desc}
    • FAISS IndexFlatIP with L2-normalized embeddings (cosine similarity)

Example: manually refreshing an index

curl -X POST http://localhost:8000/refresh \
     -H 'Content-Type: application/json' \
     -d '{"year": 2026, "semester": "Spring"}' | jq

Example: running a semantic search

# Threshold-based search (returns all courses with similarity > 0.45)
curl "http://localhost:8000/search?query=deep%20reinforcement%20learning&year=2026&semester=Spring&threshold=0.45" | jq

# Response includes similarity scores for ranking
{
  "query": "deep reinforcement learning",
  "threshold": 0.45,
  "count": 12,
  "results": [
    {
      "subject": "COMPSCI",
      "courseNumber": "285",
      "score": 0.713,
      "title": "Deep Reinforcement Learning, Decision Making, and Control"
    },
    ...
  ]
}

Backend Integration (Node / Express)

  • Added SEMANTIC_SEARCH_URL environment variable pointing to Python service

  • Implemented lightweight proxy endpoint /api/semantic-search/courses:

    • Forwards requests to Python service
    • Returns only {subject, courseNumber, score} for efficient frontend filtering
    • Frontend maintains API response order (sorted by semantic similarity)
  • Updated GraphQL behavior:

    • Introduced hasCatalogData field for term filtering
    • Updated resolver to use terms(withCatalogData: true)

Frontend (Catalog UI)

  • AI Search toggle (✨ sparkle button) to activate semantic search mode
  • Semantic results preserve backend ordering (by similarity score)
  • Frontend maps semantic results to full course objects for display
  • Graceful fallback to fuzzy search when semantic search unavailable

Technical Decisions

Why BGE over other models?

  • BGE (BAAI General Embedding) is specifically optimized for retrieval tasks
  • Better semantic understanding than general-purpose models (all-MiniLM, mpnet)
  • Supports instruction prefixes for improved query understanding
  • 109M parameters - good balance of accuracy and speed

Why threshold instead of top-k?

  • Threshold-based filtering returns all relevant results, not arbitrary top-k
  • More flexible - can return 5 results for specific queries, 50 for broad queries
  • Similarity score threshold (0.45) ensures quality over quantity
  • Searches top 500 candidates for performance, then applies threshold

Model Options Available (hardcoded in main.py)

# Current: BAAI/bge-base-en-v1.5 (best for retrieval)
# Alternatives:
#   BAAI/bge-small-en-v1.5       (faster, 33M params)
#   BAAI/bge-large-en-v1.5       (most accurate, 335M params)
#   all-mpnet-base-v2            (general purpose, 110M params)
#   all-MiniLM-L6-v2             (fastest, 22M params)

Next Steps

  1. Datapuller Integration: TOP PRIORITY!
    Automatically trigger /refresh endpoint when new catalog data is pulled

  2. Fine-tuning for Berkeley Courses
    Collect user feedback dataset (query + relevant/irrelevant courses) to fine-tune BGE specifically for Berkeley course search

e. Query Expansion
Handle abbreviations (NLP → Natural Language Processing) and synonyms


Based on: Initial prototype by Jacky (last semester)
Frontend integration: @PineND

@vaclisinc vaclisinc self-assigned this Nov 16, 2025
@vaclisinc vaclisinc marked this pull request as draft November 16, 2025 22:17
@vaclisinc vaclisinc marked this pull request as ready for review November 16, 2025 22:21
@vaclisinc vaclisinc force-pushed the feat/semantic-search-vaclis branch from 4fa576b to a4db7f3 Compare November 16, 2025 23:12
@maxmwang
Copy link
Contributor

Will review later, but this will need infra changes before being merged.

@vaclisinc vaclisinc force-pushed the feat/semantic-search-vaclis branch from a4db7f3 to 031bb9e Compare November 18, 2025 00:39
@vaclisinc
Copy link
Contributor Author

vaclisinc commented Nov 18, 2025

Hey guys! I've implemented semantic search on Berkeleytime using BGE embeddings. It already works pretty well, but I’d like to fine-tune it specifically for Berkeley courses. 🎯

I need your help building a small training dataset.

Please actually try some searches on Berkeleytime (using the new semantic search), and whenever you see results that look clearly wrong or surprising, send me an example in this format:

{
  "query": "planning about my career",
  "good_results": ["MBA 209P", "MCELLBI 295"],
  "bad_results": ["LDARCH 205", "CYPLAN 116", "CYPLAN 208"],
  "missing_courses": ["IAFIRCAM 198BC", "ARCH 198BC", "MUSIC 198BC", "COMLIT 198BC"] // optional
}

Where:

  • query = what you searched for (it can be a full sentence, not just keywords)
  • good_results = courses from the search results that ARE relevant (should rank high)
  • bad_results = courses from the search results that feel clearly NOT related
  • missing_courses (optional) = courses you strongly expected to see but that did NOT show up at all

What I’m especially looking for:

  • Natural language queries like:
    • “planning about my career”
    • “I want to get into AI research from a non-CS background”
    • “I like math but hate proofs, what should I take?”
  • Any query where the results feel clearly off, noisy, or surprising

Goal: ~50–100 examples total.
Even 3–5 examples from you would be super helpful. 🙏

@vaclisinc vaclisinc changed the title Feat/Introduce semantic search! Feat/Introduce AI semantic search! Nov 18, 2025
vaclisinc and others added 8 commits December 2, 2025 19:01
* disable sections/lectures, scroll hide bug, event clear bug, calendar month bug, leftborder color bug

* fix: hardcode color for sidebar header

* fix: urgent bug of cannot adding a class which is not in primarySection (access _class before initialization)

* fix: minor format error

---------

Co-authored-by: vaclis.mbp <[email protected]>
* hasCatalogItem true by default

* classes datapuller populate terms

* Avoid N+1 enrollment fetches in getCatalog

* clean up
PineND and others added 26 commits December 2, 2025 19:01
Restores courseId, course.subject, and course.number fields to
GET_CANONICAL_CATALOG_QUERY that were removed in d0a37f1.

These fields are essential for cross-listed course functionality:

- courseId: Required by Class.course resolver (class/resolver.ts:112)
  to fetch the parent course record for cross-listed courses like
  DATA C100 / STAT C100 which share courseId but have different subjects

- course.subject & course.number: Required by the resolver override
  (class/resolver.ts:118-119) and grade lookup key generation
  (catalog/controller.ts:320) to ensure each cross-listed variant
  displays the correct department name and historical grade distribution

Without these fields:
- Cross-listed courses cannot be identified (no courseId linking)
- Grade distributions show incorrect data (wrong lookup keys)
- Course metadata displays wrong department (no subject override)

Database verification shows courseId '148047' links DATA C100 and
STAT C100 across multiple terms, confirming the need for these fields.

Note: termId/sessionId not restored as catalog controller pre-populates
enrollment data, making those fields unnecessary for this query.
…isting

The catalog controller was not overriding course.subject and course.number
with class-specific values, causing cross-listed courses to be indexed and
displayed with incorrect department names.

Issue:
- DATA C100 and STAT C100 share courseId '148047'
- The parent course record has subject: 'STAT', number: 'C100'
- When catalog fetches DATA C100, it was using the parent course's subject
- This caused getIndex() (line 33) to index DATA C100 as 'STAT C100'
- Search results would show wrong department for cross-listed variants

Fix:
Added override logic after formatCourse() to set:
- formattedCourse.subject = _class.subject (DATA, not STAT)
- formattedCourse.number = _class.courseNumber (C100)

This matches the behavior in Class.course resolver (class/resolver.ts:118-119)
and ensures consistency across all code paths that access course metadata.

Benefits:
- Search indexing uses correct subject for each cross-listed variant
- Course metadata displays correct department in all contexts
- Behavior now consistent between catalog controller and GraphQL resolver
* reassign color to fit timeframe

* dynamically determined time

* rename csv

* only sunrise/sunset, daytime, nightime

---------

Co-authored-by: maxmwang <[email protected]>
* feat: adding enrollment tab into catalog

* fix: minor format

* Make enrollment graph same as the Enrollment page

* linting

* styling improvements

---------

Co-authored-by: PineND <[email protected]>
* pill v1

* finished section

* small fixes

* Update apps/frontend/src/components/Class/Sections/Sections.module.scss

Co-authored-by: Copilot <[email protected]>

* copilot fixes

* accessible table

* location support

---------

Co-authored-by: Copilot <[email protected]>
@vaclisinc vaclisinc force-pushed the feat/semantic-search-vaclis branch from 3e0ecce to 94d9609 Compare December 3, 2025 03:10
semester: semester as Semester,
sectionNumber: selectedClass.primarySection.number,
sectionNumber:
selectedClass === null
@vaclisinc vaclisinc marked this pull request as draft December 3, 2025 03:12
@vaclisinc vaclisinc closed this Dec 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants